Creating a random sample from a pandas DataFrame

Overview:

  • Samples are subsets of an entire dataset. The whole dataset is called as population.

  • It is difficult and inefficient to conduct surveys or tests on the whole population. Hence sampling is employed to draw a subset with which tests or surveys will be conducted to derive inferences about the population.

  • During the sampling process, if all the members of the population have an equal probability of getting into the sample and if the samples are randomly selected, the process is called Uniform Random Sampling.

  • If some of the items are assigned more or less weights than their uniform probability of selection, the sampling process is called Weighted Random Sampling.

  • The pandas DataFrame class provides the method sample() that returns a random sample from the DataFrame.

Example 1 - Explicitly specify the sample size:

# Example Python program that creates a random sample
# from a pandas DataFrame  
import pandas as pds

# Age vs call duration
callTimes = {"Age": [20,25,31,37,43,44,52,58,64,68,70,77,82,86,91,96],
             "Call Duration":[17,25,10,15,5,7,15,25,30,35,10,15,12,14,20,12]};
dataFrame = pds.DataFrame(data=callTimes);

# Random_state makes the random number generator to produce
# the same sequence every time
sampleData = dataFrame.sample(n=5, random_state=5);
print("Random sample:");
print(sampleData);

Output:

Random sample:
    Age  Call Duration
5    44              7
1    25             25
7    58             25
2    31             10
10   70             10

Example 2 - Specify the sample size as a fraction of the population size:

# Example python program that samples
# a DataFrame specifying the sample
# size as a proprtion to the DataFrame size

# Uses FiveThirtyEight Comic Characters Dataset
# from kaggle under the license - CC0:Public Domain
import pandas as pds

comicData        = "/data/dc-wikia-data.csv";
comicDataLoaded    =  pds.read_csv(comicData);
print("(Rows, Columns) - Population:");
print(comicDataLoaded.shape);

# Sample size as 1% of the population
sampleCharcaters = comicDataLoaded.sample(frac=0.01);
print("Sample:");
print(sampleCharcaters);

Output:

(Rows, Columns) - Population:
(6896, 13)
Sample:
      page_id  ...    YEAR
2952    57836  ...  1998.0
5628   183803  ...  2010.0
1499   137474  ...  1992.0
6042   191975  ...  1997.0
5597   206663  ...  2010.0
...       ...  ...     ...
1174    15721  ...  1955.0
1267   161066  ...  2009.0
851    128698  ...  1965.0
4693   153914  ...  1988.0
3188    93393  ...  2006.0

[69 rows x 13 columns]

Example 3 - Random sampling using weights: 

# Example Python program that creates a random sample
# from a population using weighted probabilties
import pandas as pds

# TimeToReach vs distance
time2reach =  {"Distance":[10,15,20,25,30,35,40,45,50,55],
                "TimeToReach":[15,20,25,30,40,45,50,60,65,70]};

dataFrame = pds.DataFrame(data=time2reach);
w = pds.Series(data=[0.05, 0.05, 0.05,
                     0.05, 0.05, 0.1,
                     0.15, 0.15, 0.15,
                     0.2]);

# Random_state makes the random number generator to produce
# the same sequence every time
sampleData = dataFrame.sample(n=5,
                              random_state=5,
                              weights=w);

print("Random sample using weights:");
print(sampleData);

Output:

Random sample using weights:
   Distance  TimeToReach
4        30           40
9        55           70
6        40           50
7        45           60
8        50           65

Copyright 2024 © pythontic.com